Task: Analyze public data about smart device use habits, and make recommendations for improving marketing strategies for the company.
The data is available through Kaggle, specifically the FitBit Fitness Tracker Data, which consists of 18 CSV files of various sizes. The largest file is 85.4 MB, which is too large for handling in a spreadsheet but perfectly fine for R. Therefore, I decided to use RStudio on my laptop for the data analysis.
Let’s load some helpful library packages.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.5
## âś” forcats 1.0.0 âś” stringr 1.5.1
## âś” ggplot2 3.4.4 âś” tibble 3.2.1
## âś” lubridate 1.9.3 âś” tidyr 1.3.0
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
library(RColorBrewer)
There are 18 csv files with data in them.
Reading all files into R can be done in various ways; however, not all 18 files are needed simultaneously. For this reason, instead of reading all 18 files at once, we will read only a few specific ones initially and the others later whenever they are needed.
Loading all csv files from a directory could be done using their file name as data frame names the following way:
#files <- list.files(pattern = "\\.csv$", full.names = TRUE)
#data_list <- map(files, read_csv)
#file_names <- tools::file_path_sans_ext(basename(files))
#names(data_list) <- file_names
First, let us concentrate on the daily data files.
Let’s read the file contents into data frames and “glimpse” at them.
activity_daily <- read_csv("./dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
calories_daily <- read_csv("dailyCalories_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
intensities_daily <- read_csv("dailyIntensities_merged.csv")
## Rows: 940 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (9): Id, SedentaryMinutes, LightlyActiveMinutes, FairlyActiveMinutes, Ve...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
steps_daily <- read_csv("dailySteps_merged.csv")
## Rows: 940 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDay
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sleep_daily <- read_csv("./sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(activity_daily)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(calories_daily)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 1775…
glimpse(intensities_daily)
## Rows: 940
## Columns: 10
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/…
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
glimpse(steps_daily)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366…
## $ ActivityDay <chr> "4/12/2016", "4/13/2016", "4/14/2016", "4/15/2016", "4/16/…
## $ StepTotal <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 1054…
glimpse(sleep_daily)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ SleepDay <chr> "4/12/2016 12:00:00 AM", "4/13/2016 12:00:00 AM", "…
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
The dates are formatted as character strings. We will change them to date format. We also make sure that they are all called ActivityDate for uniformity.
calories_daily <- rename(calories_daily, ActivityDate = ActivityDay)
intensities_daily <- rename(intensities_daily, ActivityDate = ActivityDay)
steps_daily <- rename(steps_daily, ActivityDate = ActivityDay)
sleep_daily <- rename(sleep_daily, ActivityDate = SleepDay)
activity_daily$ActivityDate <- mdy(activity_daily$ActivityDate)
calories_daily$ActivityDate <- mdy(calories_daily$ActivityDate)
intensities_daily$ActivityDate <- mdy(intensities_daily$ActivityDate)
steps_daily$ActivityDate <- mdy(steps_daily$ActivityDate)
sleep_daily$ActivityDate <- date(mdy_hms(sleep_daily$ActivityDate))
Let us look at the data frames again:
glimpse(activity_daily)
## Rows: 940
## Columns: 15
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
glimpse(calories_daily)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 177…
glimpse(intensities_daily)
## Rows: 940
## Columns: 10
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
glimpse(steps_daily)
## Rows: 940
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-…
## $ StepTotal <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 105…
glimpse(sleep_daily)
## Rows: 413
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 20…
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
We know that the Ids represent users. Let’s check how man different users there are.
activity_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 33 1503960366 8877689391
calories_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 33 1503960366 8877689391
intensities_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 33 1503960366 8877689391
steps_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 33 1503960366 8877689391
sleep_daily %>% summarise(n_distinct(Id), min(Id), max(Id))
## # A tibble: 1 Ă— 3
## `n_distinct(Id)` `min(Id)` `max(Id)`
## <int> <dbl> <dbl>
## 1 24 1503960366 8792009665
There are 33 distinct users in four of the data frames and 24 distinct users in the sleep_daily data frame. We must handle the sleep data carefully when integrating it with the other activity data because the user sets do not match.
Three different intensity types are recorded for activities: Lightly Active Minutes, Fairly Active Minutes, and Very Active Minutes. Let’s combine these to create a new column called Total Active Minutes and also compare each with how calories are burnt through activities of different intensities.
ggplot(data=activity_daily, aes(x=LightlyActiveMinutes, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Lightly Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=FairlyActiveMinutes, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Fairly Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=VeryActiveMinutes, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Very Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
activity_daily <- activity_daily %>%
mutate(TotalActiveMinutes = LightlyActiveMinutes + FairlyActiveMinutes + VeryActiveMinutes)
ggplot(data=activity_daily, aes(x=TotalActiveMinutes, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Total Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We must be cautious with the comparison because the scales on the axes are not identical. While there is a good correlation between total active minutes and calories burned, it’s noteworthy that burning an average of 3,000 calories requires almost 500 minutes of total activities, which amounts to more than 8 hours. In contrast, the data for Very Active Minutes indicates that an average of 3,000 calories can be burned with only about 80 minutes of activity. This underscores the significance of intensive activities. To further investigate this, we will redraw the Calories vs. Total Active Minutes figure, adding a color gradient based on the proportion of very active minutes to total active minutes (VAMinProp = VeryActiveMinutes / TotalActiveMinutes).
activity_daily <- activity_daily %>%
mutate(VAMinProp = VeryActiveMinutes/TotalActiveMinutes)
ggplot(data=activity_daily, aes(x=TotalActiveMinutes, y=Calories)) +
geom_point(aes(colour = VAMinProp)) + scale_colour_gradient2() + geom_smooth() + labs(title="Calories vs. Total Active Minutes with Proportion of Very Active Minutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The above figure also shows that short but higher intensity activities have more significant calorie-burning effects. This is very important information for people with sedentary work and lifestyles who might have limited time for exercise.
Next, we turn our attention to steps and calories, which are in separate data frames. Let’s merge them by the IDs and dates, and then plot the calories against the number of steps taken.
calories_steps <- merge(calories_daily, steps_daily, by=c('Id', 'ActivityDate'))
glimpse(calories_steps)
## Rows: 940
## Columns: 4
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-…
## $ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 2035, 1786, 177…
## $ StepTotal <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019, 15506, 105…
ggplot(data=calories_steps, aes(x=StepTotal, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories Burnt vs. Total Steps")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
As before, and not surprisingly, the figure shows a good correlation between the calories burned and the total number of steps. The issue, again, is that it takes an average of 20,000 steps to burn 3,000 calories, which, from my personal experience, is quite a lot of steps. On the other hand, there is a large deviation in the calories burned, especially in the range of 10,000-15,000 steps. Let’s return to the activity intensities and see how the distances covered during the various intensity activities correlate with the calories burned.
ggplot(data=activity_daily, aes(x=LightActiveDistance, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Lightly Active Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=ModeratelyActiveDistance, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Moderately Active Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=VeryActiveDistance, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Very Active Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data=activity_daily, aes(x=TotalDistance, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Calories vs. Total Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
cor(activity_daily$LightActiveDistance, activity_daily$Calories)
## [1] 0.4669168
cor(activity_daily$ModeratelyActiveDistance, activity_daily$Calories)
## [1] 0.2167899
cor(activity_daily$VeryActiveDistance, activity_daily$Calories)
## [1] 0.4919586
cor(activity_daily$TotalDistance, activity_daily$Calories)
## [1] 0.6449619
There is a strong correlation (0.645) only with the total distance.
Let’s see if it makes any difference when we take into account the proportion of steps taken during very intense activities with respect to the total distance (VADistProp).
activity_daily <- activity_daily %>%
mutate(VADistProp = VeryActiveDistance/TotalDistance)
ggplot(data=activity_daily, aes(x=TotalDistance, y=Calories)) +
geom_point(aes(colour = VADistProp)) + scale_colour_gradient2() + geom_smooth() + labs(title="Calories vs. Total Distance with Proportion of Very Active Distance")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The figure does not demonstrate that covering a small distance at high intensity burns more calories than covering a long distance at low intensity. This could be because high-intensity activities sometimes result in covering a smaller distance. In other words, someone who covers a total distance of 5 miles might have done most of this distance through moderate or light activities (hence the white color) but can still burn a lot of calories with high-intensity stationary activities. However, the figure does show that longer distances were mostly covered during very intensive activities.
Next, we turn our attention to the the sleep_daily data frame. We compare the TotalMinutesAsleep to the TotalTimeInBed to see how the length of sleep is related to the length of time spent in bed.
ggplot(data=sleep_daily, aes(x=TotalTimeInBed, y=TotalMinutesAsleep)) +
geom_point() + geom_smooth() + labs(title="Sleep Time vs. Bed Time")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The bulk of the data indicates that the difference between how much time people sleep and how much time they spend in bed is not less than an hour. This is valuable information, suggesting that if we want to sleep more, we simply need to spend more time in bed. It is somewhat concerning, however, that there are several data points showing people who sleep less than 4 hours or more than 10 hours, or who stay in bed for more than 14 hours. Let’s remove those data points as outliers.
sleep_daily %>% filter(TotalMinutesAsleep > 240 & TotalMinutesAsleep < 600 & TotalTimeInBed <840) %>%
ggplot(aes(x=TotalTimeInBed, y=TotalMinutesAsleep)) +
geom_point() + geom_smooth() + labs(title="Sleep Time vs. Bed Time with outliers removed")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
We merge the activity_daily and sleep_daily data frames using inner merge on the Id’s and on the ActivityDate. Remember that there are 33 distinct people in the activity data frame but only 24 in the sleep data frame.
activities_sleep <- merge(activity_daily, sleep_daily, by=c('Id', 'ActivityDate'))
glimpse(activities_sleep)
## Rows: 413
## Columns: 21
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityDate <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-…
## $ TotalSteps <dbl> 13162, 10735, 9762, 12669, 9705, 15506, 10544…
## $ TotalDistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ TrackerDistance <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3…
## $ LightActiveDistance <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6…
## $ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes <dbl> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3…
## $ FairlyActiveMinutes <dbl> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,…
## $ LightlyActiveMinutes <dbl> 328, 217, 209, 221, 164, 264, 205, 211, 262, …
## $ SedentaryMinutes <dbl> 728, 776, 726, 773, 539, 775, 818, 838, 732, …
## $ Calories <dbl> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177…
## $ TotalActiveMinutes <dbl> 366, 257, 272, 267, 222, 345, 245, 238, 324, …
## $ VAMinProp <dbl> 0.06830601, 0.08171206, 0.10661765, 0.1348314…
## $ VADistProp <dbl> 0.2211765, 0.2252511, 0.3407643, 0.3321079, 0…
## $ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, …
activities_sleep %>% summarise(n_distinct(Id), min(Id), max(Id))
## n_distinct(Id) min(Id) max(Id)
## 1 24 1503960366 8792009665
Let’s see if burning more calories results in more sleep by plotting column TotalMinutesAsleep against column Calories.
ggplot(data=activities_sleep, aes(x=TotalMinutesAsleep, y=Calories)) +
geom_point() + geom_smooth() + labs(title="Sleep vs. Activities")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
cor(activities_sleep$Calories,activities_sleep$TotalMinutesAsleep)
## [1] -0.02852571
Apparently, sleeping and the amount of calories burnt are not correlated.
Let us look at the connection between sedentary minutes and total minutes asleep.
ggplot(data=activities_sleep, aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) +
geom_point() + geom_smooth() + labs(title="Sleep vs. SedentaryMinutes")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
cor(activities_sleep$SedentaryMinutes,activities_sleep$TotalMinutesAsleep)
## [1] -0.599394
The above figure suggests that there is a negative correlation between sedentary minutes and total minutes asleep, but only for more than 8 hours (480 minutes) of sedentary time. This is interesting data that could be investigated further, and an app could consider it, possibly issuing warnings that neither very low nor very high sedentary time is beneficial for sleep.
Let’s examine the filtered data that includes only more than 8
hours (480 minutes) of sedentary time.
activities_sleep_filtered <- activities_sleep %>% filter(SedentaryMinutes > 480)
ggplot(activities_sleep_filtered, aes(x=SedentaryMinutes, y=TotalMinutesAsleep)) +
geom_point() + geom_smooth() +
labs(title="Sleep vs. SedentaryMinutes for SedentaryMinutes > 480")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
cor(activities_sleep_filtered$SedentaryMinutes, activities_sleep_filtered$TotalMinutesAsleep)
## [1] -0.6809041
Data with more than 8 hours (480 minutes) of sedentary time only shows a stronger negative correlation between sedentary time and sleep time.
Let us load and look at the weightLogInfo_merged file:
weightLog <- read_csv("./weightLogInfo_merged.csv")
## Rows: 67 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
## lgl (1): IsManualReport
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(weightLog)
## Rows: 67
## Columns: 8
## $ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ Date <chr> "5/2/2016 11:59:59 PM", "5/3/2016 11:59:59 PM", "4/13/2…
## $ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ Fat <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…
weightLog %>% select(BMI) %>% summary()
## BMI
## Min. :21.45
## 1st Qu.:23.96
## Median :24.39
## Mean :25.19
## 3rd Qu.:25.56
## Max. :47.54
n_distinct(weightLog$Id)
## [1] 8
Note that the mean BMI is 25.19, which indicates slightly overweight individuals. In any case, the 67 total observations coming from 8 individuals represent a very small dataset. This data is not sufficient to draw conclusions from it.
Let us load and look at the hourlyIntensities_merged file:
hourlyIntensities <- read_csv("./hourlyIntensities_merged.csv")
## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(hourlyIntensities)
## Rows: 22,099
## Columns: 4
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 15039…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/1…
## $ TotalIntensity <dbl> 20, 8, 7, 0, 0, 0, 0, 0, 13, 30, 29, 12, 11, 6, 36, 5…
## $ AverageIntensity <dbl> 0.333333, 0.133333, 0.116667, 0.000000, 0.000000, 0.0…
The column ActivityHour contains both date and time together as a character string. We have to separate them in order to examine the times of the day when people are more active. Let’s create a now column called ActivityDay and in the existing ActivityHour we will store only the hours of the day.
hourlyIntensities <- hourlyIntensities %>%
mutate(ActivityDay = date(mdy_hms(hourlyIntensities$ActivityHour)) )
hourlyIntensities <- hourlyIntensities %>%
mutate(ActivityHour = hour(mdy_hms(hourlyIntensities$ActivityHour)))
glimpse(hourlyIntensities)
## Rows: 22,099
## Columns: 5
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 15039…
## $ ActivityHour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ TotalIntensity <dbl> 20, 8, 7, 0, 0, 0, 0, 0, 13, 30, 29, 12, 11, 6, 36, 5…
## $ AverageIntensity <dbl> 0.333333, 0.133333, 0.116667, 0.000000, 0.000000, 0.0…
## $ ActivityDay <date> 2016-04-12, 2016-04-12, 2016-04-12, 2016-04-12, 2016…
Let’s create a data frame TothourlyIntensities that has activity intensities grouped by the hour of the day.
TothourlyIntensities <- hourlyIntensities %>%
group_by(ActivityHour) %>%
summarise(total_intensity = sum(TotalIntensity))
glimpse(TothourlyIntensities)
## Rows: 24
## Columns: 2
## $ ActivityHour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ total_intensity <dbl> 1989, 1324, 974, 414, 590, 4614, 7235, 9993, 13656, 14…
Let’s see what time of the day people are the most active. We use histogram for this purpose.
ggplot(data=TothourlyIntensities, aes(x=ActivityHour, y=total_intensity)) + geom_histogram(stat = "identity") +
labs(title="Total Intensity vs. Time of Day")
## Warning in geom_histogram(stat = "identity"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
People are most active between 5 and 7 p.m., and also around noon. An app could remind users to start their exercise activities around these times.
To confirm that the periods of most intensive activity correspond to the highest calorie burn, let’s examine when people burn the most calories using the hourlyCalories_merged.csv file
hourlyCalories <- read_csv("./hourlyCalories_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(hourlyCalories)
## Rows: 22,099
## Columns: 3
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <chr> "4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", "4/12/20…
## $ Calories <dbl> 81, 61, 59, 47, 48, 48, 48, 47, 68, 141, 99, 76, 73, 66, …
As before, we separate the date and time of day from the ActivityHour column, then create a dataframe named TothourlyCalories that groups calories by the hour of the day. Finally, we use a histogram to observe at what times of the day people burn the most calories.
hourlyCalories <- hourlyCalories %>%
mutate(ActivityDay = date(mdy_hms(hourlyCalories$ActivityHour)) )
hourlyCalories <- hourlyCalories %>%
mutate(ActivityHour = hour(mdy_hms(hourlyCalories$ActivityHour)))
glimpse(hourlyCalories)
## Rows: 22,099
## Columns: 4
## $ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
## $ ActivityHour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ Calories <dbl> 81, 61, 59, 47, 48, 48, 48, 47, 68, 141, 99, 76, 73, 66, …
## $ ActivityDay <date> 2016-04-12, 2016-04-12, 2016-04-12, 2016-04-12, 2016-04-…
TothourlyCalories <- hourlyCalories %>%
group_by(ActivityHour) %>%
summarise(total_calories = sum(Calories))
glimpse(TothourlyCalories)
## Rows: 24
## Columns: 2
## $ ActivityHour <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ total_calories <dbl> 67066, 65464, 64551, 63013, 63620, 76152, 80994, 87959,…
ggplot(data=TothourlyCalories, aes(x=ActivityHour, y=total_calories)) + geom_histogram(stat = "identity") +
labs(title="Total Calories Burnt vs. Time of Day")
## Warning in geom_histogram(stat = "identity"): Ignoring unknown parameters:
## `binwidth`, `bins`, and `pad`
The maximums on the calories’ histogram matches the most active
periods (5-7 p.m. and around noon).
The recommendations for the company are to include the following features in their app: